Efficient Mining Frequent Closed Discriminative Biclusters by Sample-Growth: The FDCluster Approach

نویسندگان

  • Miao Wang
  • Xuequn Shang
  • Shaohua Zhang
  • Zhanhuai Li
چکیده

DNA microarray technology has generated a large number of gene expression data. Biclustering is a methodology allowing for condition set and gene set points clustering simultaneously. It finds clusters of genes possessing similar characteristics together with biological conditions creating these similarities. Almost all the current biclustering algorithms find bicluster in one microarray dataset. In order to reduce the noise influence and find more biological biclusters, the authors propose the FDCluster algorithm in order to mine frequent closed discriminative bicluster in multiple microarray datasets. FDCluster uses Apriori property and several novel techniques for pruning to mine biclusters efficiently. To increase the space usage, FDCluster also utilizes several techniques to generate frequent closed bicluster without candidate maintenance in memory. The experimental results show that FDCluster is more effective than traditional methods in either single micorarray dataset or multiple microarray datasets. This paper tests the biological significance using GO to show the proposed method is able to produce biologically relevant biclusters. munity, the information embedded in most of these data has not yet completely exploited. Recently, DNA microarray technology has generated a large number of gene expression data, which is typically represented by a matrix where each cell represents the gene expression level of a gene under an experimental condition. How to use these data to reveal the function DOI: 10.4018/jkdb.2010100104 70 International Journal of Knowledge Discovery in Bioinformatics, 1(4), 69-88, October-December 2010 Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. and biological process of genes poses a great challenge of analysis algorithms. Various data mining techniques have been employed to infer useful biological information from the huge and rapid growing microarray data set. One widely used method to infer relationship among genes in microarray data set is frequent pattern mining. Based on the characteristic of microarray data, (Pan et al., 2004; Cong et al., 2004) proposed to use condition enumeration method to exploit the gene patterns. However, both of above algorithms need to maintain the candidate patterns in memory, which limits the scalability. Association rules mining method is another way to analyze the gene expression data (Becquet et al., 2003; Creighton & Hanash, 2003; McIntosh & Chawla, 2007; Cong et al., 2004), which can discover the relationship among genes. However, it only can identify genes whose expression levels correlated across some conditions, it can not reveal the regulatory relations among genes. Using association rule to exploit regulatory modules has its limitations (Yeung et al., 2004). How to identify genes with similar behavior with respect to different samples? Biclustering (Cheng & Church, 2000) is a methodology allowing for condition set and gene set points clustering simultaneously. It finds clusters of genes possessing similar characteristics together with biological conditions creating these similarities. The main advantage of biclustering is the simultaneous mining module on genes and experimental condition, another advantage is its applicability on original data instead of discretized data (Zhao & Zaki, 2005). However, mining microarray data for biclusters presents the following four challenges. First, the computing of biclustering method is NP-hard (Cheng & Church, 2000). Second, biclustering method deals with original data, it should adapt to the noise-sensitive character of microarray dataset. Third, the biclustering method should allow overlapping biclusters which share some genes or conditions, which would increase the complex of biclustering algorithm. Finally, the biclustering method should be flexible enough to handle different types of biclusters. (Madeira & Oliverira, 2004) classified biclusters into four categories: (i) constant value biclusters, (ii) constant row or column biclusters, (iii) biclusters with coherent values, where each row and column is obtained by addition or multiplication of the previous row and column by a constant value and (iv) biclusters with coherent evolutions, where the direction of change of values is important rather than the coherence of the values (Pandey et al., 2009). Facing with the former three challenges above of biclustering method, some algorithms proposed to use greedy or heuristics approach for mining biclusters. In (Cheng & Church, 2000), Cheng and Church employed a greedy node deletion algorithm in their search based on using a low mean squared residue. Once a bicluster is created, its entries are replaced by random numbers and the procedure is repeated iteratively. Since then, there have been many greedy algorithms (Ben et al., 2003; Yang et al., 2003; Liu & Wang, 2007; Cheng et al., 2008; Teng & Chan, 2008; Dharan & Nair, 2009). A recent review of biclustering algorithms for biological data analysis can be found in (Madeira & Oliveira, 2004). Although these algorithms may improve the result, yet the efficiency is not very well. Another biclustering algorithm, MicroCluster (Zhao & Zaki, 2005), used weighted directed range multigraph to generate deterministic bicluster. The experimental results show this algorithm is very efficient. It also used some deletion or merging method to reduce the influence of noise. However, the main drawback of MicroCluster is that all the candidate biclusters need to be maintained in memory, which would increase computing complex and reduce space usage. How to handle the last drawback of biclustering method is very intractable. Since the clustering formulation cannot be suitable to exploit some or all types of biclusters. For instance, SAMBA (Subramanian et al., 2005) is designed to find constant value biclusters. And Cheng and Church’s method (Cheng & Church, 2000) can find both constant value and constant row or column biclusters. Biclusters with coherent trends of up or down regulation 18 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/efficient-mining-frequent-closeddiscriminative/49550?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Mining Differential Co-Expression Constant Row Bicluster in Real-Valued Gene Expression Datasets

Biclustering aims to mine a number of co-expressed genes under a set of experimental conditions in gene expression dataset. Recently, differential co-expression biclustering approach has been used to identify class-specific biclusters between two gene expression datasets. However, it cannot handle differential co-expression constant row biclusters efficiently in real-valued datasets. In this pa...

متن کامل

Efficient Mining Maximal Variant Usage and Low Usage Biclusters in Discrete Function-Resource Matrix

The functional layer is the pillar of the whole prognostics and health management system. Its effectiveness is the core of system task effectives. In this paper, we proposed a new bicluster mining algorithm: DoCluster, to effectively mine all biclusters with maximal variant usage rate and low usage rate in the discrete function-resource matrix. In order to improve the mining efficiency, DoClust...

متن کامل

Efficiently Mining Closed Subsequences with Gap Constraints

Mining frequent subsequence patterns from sequence databases is a typical data mining problem and various efficient sequential pattern mining algorithms have been proposed. In many problem domains (e.g, biology), the frequent subsequences confined by the predefined gap requirements are more meaningful than the general sequential patterns. In this paper we re-examine the closed sequential patter...

متن کامل

CLOLINK: An Adapted Algorithm for Mining Closed Frequent Itemsets

Mining of the complete set of frequent itemsets will lead to a huge number of itemsets. Fortunately, this problem can be reduced to the mining of closed frequent itemsets, which results in a much smaller number of itemsets. Methods for efficient mining of closed frequent itemsets have been studied extensively by many researchers using various strategies to prove their efficiencies such as Aprio...

متن کامل

High Fuzzy Utility Based Frequent Patterns Mining Approach for Mobile Web Services Sequences

Nowadays high fuzzy utility based pattern mining is an emerging topic in data mining. It refers to discover all patterns having a high utility meeting a user-specified minimum high utility threshold. It comprises extracting patterns which are highly accessed in mobile web service sequences. Different from the traditional fuzzy approach, high fuzzy utility mining considers not only counts of mob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJKDB

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2010